External Search Engine Mining
نویسندگان
چکیده
Search engines maintain large amounts of valuable data, such as web content, user queries, clicks, and browsing trails. This data is fully accessible only to the search engines themselves. Other parties, like users, advertisers, and researches, have very limited access to the data via public interfaces provided by search engines (e.g., the search interface). External techniques for mining search engine data are rare and under-developed. Such external methods, which rely only on public interfaces, are appealing since they can be used by anyone, not relying on the goodwill of search engines. External mining can be used by search engine users and partners to objectively benchmark the quality of the service they get and by researchers to compare search engines and study properties of the web. Even search engines themselves may benefit from external mining, as it can help them reveal their strengths and weaknesses relative to their competitors. In this work we propose a comprehensive framework for externally mining search engine indices and query logs. We designed an algorithm for estimating index properties, such as index size and freshness, language/domain/topic composition, density of spam, etc. We developed methods for sampling user queries from search engine query logs and for estimating query frequency (or popularity) in the logs. Finally, we designed an algorithm for estimating the visibility of a given web page in a search engine, and extracting the specific queries on which it is most visible. Our algorithms make extensive use of tools from statistics (Monte Carlo methods), information retrieval, and databases. The correctness and the efficiency of the algorithms was analyzed both theoretically and empirically. The empirical analysis relies on a synthetic search engine we built locally and on real commercial search engines.
منابع مشابه
External Plagiarism Detection based on Human Behaviors in Producing Paraphrases of Sentences in English and Persian Languages
With the advent of the internet and easy access to digital libraries, plagiarism has become a major issue. Applying search engines is one of the plagiarism detection techniques that converts plagiarism patterns to search queries. Generating suitable queries is the heart of this technique and existing methods suffer from lack of producing accurate queries, Precision and Speed of retrieved result...
متن کاملExploring synonyms within large commercial site search engine queries
Exploring synonyms within large commercial site search engine queries Julia Kiseleva, Andrey Simanovsky HP Laboratories HPL-2011-41 synonym mining, query log analysis We describe results of experiments of extract-ing synonyms from large commercial site search engine query log. Our primary object is product search queries. The resulting dictionary of synonyms can be plugged into a search engin...
متن کاملI Data Mining Techniques and Analysis of Concept Based User Profiles from Search Engine Logs
Search engine logs are emerging new type of data user profiling component of any personalization interesting opportunities for data mining. Early user profiling work on mining data mostly attempted to discover knowledge at the level of queries based on objects that users are interested in positive preferences but not the objects in negative preferences. In our paper we focus on search engine lo...
متن کاملExploratory Patent Search with Faceted Search and Configurable Entity Mining
Searching for patents is usually a recall-oriented problem and depending on the patent search type, quite often a problem which is characterized by uncertainty and evolution or change of the information need. We propose an exploratory strategy for patent search that exploits the metadata already available in patents in addition to the results of clustering and entity mining that are performed a...
متن کاملLandmark-Based Navigation and Path-Finding in Web Mining
We introduce a new method of Web relationship based on landmark identi cation and association. This allows us to improve the performance of searching relevant information in Web intelligent navigation. In order to be more convenient for users to nd and search Web information, we have developed an EMS Search Engine as opposed to the traditional search engines. The operations of EMS Search Engine...
متن کامل